The white wine dataset contains 4898 white wines with 11 variables on quantifying the chemical properties of each wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent). The question we would be interesting in ths analysis is to find what chemical properties influence the quality of white wines? I would do the analysis follow this sequence: Univariate analysis -> bivariate analysis -> multivariate analysis -> final plots -> reflection

Univariate Plots Section

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000
## [1] 4898   13

The data set has 4898 observations and 13 variables. All of the 11 chemical variables type are numerical. The dependent variable quality is integer.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000

The distribution of quality is close to normal distribution.The range of quality scores are from 3 to 9. Most of the quality data fall in 5 ~ 7. Median score is 6 and mean score is 5.88.

The fixed.acidity distribution is normal distribution. Most of the fixed.acidity values are from 6 to 9.

The distribution of volatile.acidity is normal distribution. Most of the value fall in [0.1, 0.5]. There are a few outlier bigger than 0.6.

Residual.sugar distribution is skewed. So I use log10 transfromation to better see the distribution.

log10(residual.sugar) distribution is bimodal.

The distribution of chlorides is normal distribution. Most of the values fall in [0,0.1]. There are a few outliers bigger than 0.1.

The distribution of pH is normal distribution. Median pH is 3.18.

The distribution of alcohol is multimodal. The range of alcohol is [8, 14]

The distribution of citric.acid is very close to normal distribution. But it is interesting that there are a dramatically up in 0.5. I am curious why there are such phenomenon. So in the next section I will draw the density plot for each quality level to see what happend.

The free.sulfur.dioxide distribution is normal distribution. Most of the values fall in [0, 100]. There are some outliers bigger than 100.

The total.sulfur.dioxide distribution is also normal distribution. Most of the values fall in 0, 250. There are some outlier bigger than 300.

The distribution of density is bimodal. Most of the values fall in [0.99, 1]. There are some outlier bigger than 1.05.

The distribution of sulphates is normal distribution. Most of the values fall in [0.4, 0.6].

Univariate Analysis

What is the structure of your dataset?

There are 4,898 kinds of white wine in the dataset with 11 attributes(fixed.acidity, volatile.acidity, chlorides, pH, citric.acid, residual.sugar, free.sulfur.dioxide, total.sulfur.dioxide, density, sulphates, alcohol). All of these attributes are continuous variables.

The score of quality in the dataset are between 3 and 9. The higher the score, the better the quality. Most of the quality data fall in 5 ~ 7. Median score is 6 and mean score is 5.88.

fixed.acidity, volatile.acidity, chlorides, pH, citric.acid, free.sulfur.dioxide, total.sulfur.dioxide, sulphates distribution is or close to normal distribution.

What is/are the main attributes of interest in your dataset?

I’d like to determine which chemical properties influence the quality of white wines. I have no idea for now about which variables are more suspicious, so I will print out the correlation table first in the next section to see the correlation.

Did you create any new variables from existing variables in the dataset?

No, I didn’t.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Yes. Residual.sugar is skewed. I log10 residual.sugar to see better the distribution. And I found that log10(residual.sugar) is distribution. The distribution of citric.acid is very close to normal distribution. But it is interesting that there are a dramatically up in 0.5. I am curious why there are such phenomenon. So in the next section I will draw the density plot for each quality level to see what happend.

Bivariate Plot Section

It seems that quality is correlated to alcohol and density, then chlorides, total.sulfur.dioxide, volatile.acidity. And there are strong correlations between the independent variables: alcohol & density r=0.8 alcohol & residual.sugar r=0.5 density & risidual.sugar r=0.84 density & total.sulfur.dioxide r=0.53 total.sulfur.dioxide & free.sulfur.dioxide r=0.62

alcohol & density r=0.8

## 
##  Pearson's product-moment correlation
## 
## data:  wq$density and wq$alcohol
## t = -87.255, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7908646 -0.7689315
## sample estimates:
##        cor 
## -0.7801376

There is a clear linear relationship between density and alcohol. As the density increase, the alcohol decrease. The pearson’s r = -0.780.

alcohol & residual.sugar r=0.5

## 
##  Pearson's product-moment correlation
## 
## data:  wq$residual.sugar and wq$alcohol
## t = -35.321, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4726723 -0.4280267
## sample estimates:
##        cor 
## -0.4506312

We can see a relationship between residual.sugar and alcohol. Higher residual.sugar has a lower alcohol level. The pearson’s r = 0.45.

Density vs Residual.sugar

## 
##  Pearson's product-moment correlation
## 
## data:  wq$residual.sugar and wq$density
## t = 107.87, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8304732 0.8470698
## sample estimates:
##       cor 
## 0.8389665

There is a clear linear relationship between density and residual.sugar. The pearson’s r = 0.839.

density & total.sulfur.dioxide r=0.53

## 
##  Pearson's product-moment correlation
## 
## data:  wq$total.sulfur.dioxide and wq$density
## t = 43.719, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5094349 0.5497297
## sample estimates:
##       cor 
## 0.5298813

We can see a linear relationship between density and total.sulfur.dioxide. As the total.sulfur.dioxide increase, density increase. The pearson’s r = 0.530.

total.sulfur.dioxide & free.sulfur.dioxide r=0.62

## 
##  Pearson's product-moment correlation
## 
## data:  wq$total.sulfur.dioxide and wq$free.sulfur.dioxide
## t = 54.645, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5977994 0.6326026
## sample estimates:
##      cor 
## 0.615501

We can see a linear relationship between total.sulfur.dioxide & free.sulfur.dioxide. As total.sulfur.dioxide increase, free.sulfur.dioxide increase. The pearson’s r = 0.616.

quality & alcohol

## 
##  Pearson's product-moment correlation
## 
## data:  wq$alcohol and wq$quality
## t = 33.858, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4126015 0.4579941
## sample estimates:
##       cor 
## 0.4355747

At the low quality 3 - 5, the alcohol has a lower median as quality increase. At quality 6 - 9, the alcohol median increase as the quality increase. The highest quality has a highest alcohol median. Quality 5 has a lowest alcohol median. The pearson’s r = 0.436 between quality and alcohol.

quality & density

## 
##  Pearson's product-moment correlation
## 
## data:  wq$density and wq$quality
## t = -22.581, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3322718 -0.2815385
## sample estimates:
##        cor 
## -0.3071233

The highest quality has the lowest density median. As the quality increase, the density median decrease. The pearson’s r = -0.307 between quality and density.

quality vs. chlorides

## 
##  Pearson's product-moment correlation
## 
## data:  wq$chlorides and wq$quality
## t = -15.024, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2365501 -0.1830039
## sample estimates:
##        cor 
## -0.2099344

The best quality has a lowest chlorides. And as the quality increase, the chlorides decrease. The pearson’s r = -0.210 between quality and chlorides.

quality vs. total.sulfur.dioxide

## 
##  Pearson's product-moment correlation
## 
## data:  wq$total.sulfur.dioxide and wq$quality
## t = -12.418, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2017563 -0.1474524
## sample estimates:
##        cor 
## -0.1747372

Highest total.sulfur.dioxide has the lowest quality median. It seems that as the total.sulfur.dioxide lower, the higher quality.

quality vs. volatile.acidity

## 
##  Pearson's product-moment correlation
## 
## data:  wq$volatile.acidity and wq$quality
## t = -13.891, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2215214 -0.1676307
## sample estimates:
##       cor 
## -0.194723

Low quality 3 and 4 have bigger volatile.acidity variance. For quality 5 and 6, the variances are the smallest. For quality higher than 6, the medians are very close. For lower quality 4 and 5, the medians are slightly larger. The pearson’s r = -0.195 between quality and volatile.acidity.

Bivariate Analysis

Talk about some of the relationship you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Quality correlates with alcohol(r=0.4) and density(r=0.3), then chlorides(r=0.2), total.sulfur.dioxide(r=0.2), volatile.acidity(r=0.2).

Higer alcohol has a higher probability of having high quality. Lower density has a higher probability of having higher quality. Higer quality seems to have a lower chlorides, lower total.sulfur.dioxide.

Did you observe any interesting relationships between the other features?

Alcohol correlates strongly with density. As the density increase, the alcohol decrease. Alcohol also corelated with residual.sugar. Higer residual sugar has lower alcohol.

Density has a linear relationship with residual.sugar. As residual.sugar increase, the density increase. Density also correlated with total.sulfur.dioxide. As total.sulfur.dioxide increase, density increase.

Total.sulfur.dioxide correlated with free.sulfur.dioxide. As total.sulfur.dioxide increase, free.sulfur.dioxide increase.

What was the strongest relationship you found?

The quality is strongly correlated with alcohol and density. But alcohol and density are strongly correlated. So we will just use alcohol in the regression.

Multivariate Plot Section

quality vs. alcohol vs. chloride

Highest quality has a smaller range of alcohol: [10.3, 12.8]. Hold the alcohol, highest quality has the lowest chlorides. lower quality has higer chlorides.

quality vs. alcohol vs. volatile.acidity

Hold alcohol, higher volatile.acidity seems to have lower quality.

quality vs. alcohol vs. total.sulfur.dioxide

Holde the alcohol, it seems that higher total.sulfur.dioxide has higher quality.

Quality 9 bimodal

I found a very interesting thing in the density plot that quality 9 has a bimodal distribution for alcohol, fixed.acidity, residual.sugar, free.sulfur.dioxide, density and volatile.acidity. I guess maybe it is because the reponse variable quality is not a concrete variable that we can correctly measure using some function or other measurement. The quality score is quite subjective and for different tester, the standard may be different. So it lead to the bimodal distribution for highest quality.

## 
## Calls:
## m1: lm(formula = quality ~ alcohol, data = wq)
## m2: lm(formula = quality ~ alcohol + volatile.acidity, data = wq)
## m3: lm(formula = quality ~ alcohol + volatile.acidity + sulphates, 
##     data = wq)
## m4: lm(formula = quality ~ alcohol + volatile.acidity + sulphates + 
##     residual.sugar, data = wq)
## m5: lm(formula = quality ~ alcohol + volatile.acidity + sulphates + 
##     residual.sugar + fixed.acidity, data = wq)
## 
## ===========================================================================
##                        m1         m2         m3         m4         m5      
## ---------------------------------------------------------------------------
##   (Intercept)        2.582***   3.017***   2.803***   2.112***   2.672***  
##                     (0.098)    (0.098)    (0.110)    (0.125)    (0.159)    
##   alcohol            0.313***   0.324***   0.325***   0.376***   0.371***  
##                     (0.009)    (0.009)    (0.009)    (0.010)    (0.010)    
##   volatile.acidity             -1.979***  -1.963***  -2.091***  -2.103***  
##                                (0.110)    (0.110)    (0.109)    (0.109)    
##   sulphates                                0.416***   0.453***   0.443***  
##                                           (0.097)    (0.095)    (0.095)    
##   residual.sugar                                      0.027***   0.028***  
##                                                      (0.002)    (0.002)    
##   fixed.acidity                                                 -0.073***  
##                                                                 (0.013)    
## ---------------------------------------------------------------------------
##   R-squared              0.2        0.2        0.2        0.3        0.3   
##   adj. R-squared         0.2        0.2        0.2        0.3        0.3   
##   sigma                  0.8        0.8        0.8        0.8        0.8   
##   F                   1146.4      773.9      523.9      434.1      355.9   
##   p                      0.0        0.0        0.0        0.0        0.0   
##   Log-likelihood     -5839.4    -5681.8    -5672.5    -5610.8    -5594.8   
##   Deviance            3112.3     2918.3     2907.3     2834.9     2816.5   
##   AIC                11684.8    11371.6    11355.0    11233.6    11203.7   
##   BIC                11704.3    11397.5    11387.5    11272.6    11249.2   
##   N                   4898       4898       4898       4898       4898     
## ===========================================================================

The variables in this linear model can account for 30% of the variance in the quality of wine.

Multivariate Analysis

Talk about some of the relationship you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your features of interest?

Alcohol influence the qualtiy of white wine most strongly. I can build a linear model on alcohol and quality. Those other variables I investigate in this section have a relationship with quality but not very strong.

Were there any interesting or surprising interactions between features?

In the density plots, quality 9 has a bimodal distribution for most of the features like alcohol, fixed.acidity, residual.sugar, free.sulfur.dioxide and volatile.acidity. My guess is that the reponse variable quality is not a concrete variable that we can correctly measure using some function or other measurement. The quality score is quite subjective and for different tester, the standard may be different. So it lead to the bimodal distribution for highest quality.

Did you create any models with your dataset? Discuss the strengths and limitation of your model

Yes, I created a linear model. The variables in the linear model account for only 30% of the variance in the quality of wine. Although all the variables are significant, alcohol explains 20% of the variance in quality, and the other 5 variables only explain 10% of the variance. This model could explain the quality to some extent, but it is not a very good predictive model.

Final Plots and Summary

Plot One

Description one

The problem we are answering is what features influence the quality of white wine. So it would give us a good understanding to see the histogram of quality and get to know the distribution and statistics of quality. The distribution of white wine quality appears to be normal. The mean of the quality is around 5.9. According to the boxplot belowed, the median of quality is around 6. Most of the wine’s qualities fall in 4-7.

Plot Two

Description Two

As the analysis from previous sections state, white wine quality is most highly related to alcohol. From the plot we can see the highest quality has a highest alcohol median. As the quality increase, the alcohol median increase.

Plot Three

Description Three

During the analysis process of finding the relationship between quality and other features, I found an interesting phenomenon. From this density plot we can see that quality 9 has a bimodal distribution for alcohol. This is also true for most of the other features fixed.acidity, residual.sugar, free.sulfur.dioxide and volatile.acidity. This is an interesting phenomenon. My guess to this phenomenon is that the reponse variable quality is not a concrete variable that we can correctly measure using some function or other measurement. The quality score is quite subjective and for different tester, the standard may be different. So it lead to the bimodal distribution for highest quality.

Reflection

The white wine data set contains infromation on 4849 white wines. I started by understanding the individual variables in the data set, and then I explored the relationship among these variables as I continued to make observations on plots. Eventually, I explored the quality of white wine across many variables and created a linear model to predict white wine quality. I assumed that the differences between each interval of quality are equal. So I treated quality as continuous variables to calculate the correlation using Pearson’s correlation. I found that the quality correlates to alcohol most strongly. Also quality correlates to fixed.acidity, free.sulfur.dioxide, volatile.acidity and chlorides, but the correlations are very weak. I found a lot of correlations between the independent variables: alcohol correlates to density and residual sugar; density correlates to residual.sugar and total.sulfur.dioxide; total.sulfur.dioxide correlates to free.sulfur.dioxide; pH correlates to fixed.acidity. The linear model includes alcohol, fixed.acidity, free.sulfur.dioxide, volatile.acidity and chlorides. This model explains only 30% of the variance in the quality. And alcohol alone explains 20% of the variance in the quality. This is probably because the assumption that treat quality as continuous variable is not appropriate. Or maybe there are other factors not included in the data set that influence the quality. Also not large enought data set may also be a reason. I would be interested in performing ordinal regression for ordianl response variable. Also I found a very interesting thing in the plots. Quality 9 has a bimodal distribution for most of the features like alcohol, fixed.acidity, residual.sugar, free.sulfur.dioxide and volatile.acidity. This surprised me. So I guess is because the reponse variable quality is not a concrete variable that we can correctly and consistently measure using some functions or other objective measurements. The quality score is quite subjective and for different tester the standard may be different. So it leads to the bimodal distribution for highest quality. The struggles of doing this project would be the data type. The example we did in the lesson has continuous response variable and various types of independent variables. So it seems there are a lot to explore and a lot of different figures we can plot. In this data set, the response variable is ordinal variable. So I don’t know what to do when I began this. I spent some time learning the difference of categorical, ordinal, interval and ratio variables, as well as searching for some methods to do regression for these variables. At last I choose to make the assumption that the difference between each interval is equal and treat the quality variable as continuous variable. Also I am able to generate different kinds of plot by exploring and refering to some websites. The result might not be good enough, but I have learned a lot through these processes.

Reference

  1. https://rpubs.com/Daria/57835
  2. https://onlinecourses.science.psu.edu/stat857/node/223
  3. P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.